Setup

What is the tidyverse?

The tidyverse consists of a few key packages for data import, manipulation, visualization and more.

Objects and Classes

Data Structures

Vectors form the basis of R data structures. Two main types are atomic and lists.

Data frames

Data frames are a special kind of list, and probably the most commonly used for data science purposes.

Importing Data

Importing data is usually the first step.

Working with Databases

Databases must be connected to, but otherwise are used just like data frames.

Selecting Columns

A common step is to subset the data by column.

Filtering Rows

To filtering data, think of a logical statement, something that can be TRUE or FALSE.

Generating new data

Another very common data processing task is to generate new variables.

Renaming columns

Merging

Merging data can take on a variety of forms, and depending on the data, can be be quite complicated.

Exercises

Selecting and filtering

Use the : operator to select successive columns.

Filter the data to award amounts less than 500000.

Generating new data

Generate a new award amount variable that is the log of the original. Give the new variable a useful name.

Python examples

Using Python for data science is not far removed from R

Python’s main data processing module is pandas

Filtering Rows

1
LS0tCnRpdGxlOiAiTW9kdWxlIDE6IERlYWxpbmcgd2l0aCBEYXRhIgpvdXRwdXQ6IAogIGh0bWxfbm90ZWJvb2s6IAogICAgaGlnaGxpZ2h0OiBweWdtZW50cwogICAgdGhlbWU6IHNhbmRzdG9uZQogICAgY3NzOiBvdGhlci5jc3MKZWRpdG9yX29wdGlvbnM6IAogIGNodW5rX291dHB1dF90eXBlOiBpbmxpbmUKLS0tCgpgYGB7ciBpbml0LCBlY2hvPUZBTFNFfQprbml0cjo6b3B0c19jaHVuayRzZXQoCiAgZWNobz1ULCAKICBldmFsID0gRiwKICBtZXNzYWdlID0gRiwgCiAgd2FybmluZyA9IEYsIAogIGNvbW1lbnQgPSBOQSwKICBSLm9wdGlvbnM9bGlzdCh3aWR0aD0xMjApLCAKICBjYWNoZS5yZWJ1aWxkPUYsIAogIGNhY2hlPVQsCiAgZmlnLmFsaWduPSdjZW50ZXInLCAKICBmaWcuYXNwID0gLjcsCiAgZGV2ID0gJ3N2ZycsIAogIGRldi5hcmdzPWxpc3QoYmcgPSAndHJhbnNwYXJlbnQnKQopCgpgYGAKCiMjIFNldHVwCgpXaGF0IGlzIHRoZSBgdGlkeXZlcnNlYD8KClRoZSBgdGlkeXZlcnNlYCBjb25zaXN0cyBvZiBhIGZldyBrZXkgcGFja2FnZXMgZm9yIGRhdGEgaW1wb3J0LCBtYW5pcHVsYXRpb24sIHZpc3VhbGl6YXRpb24gYW5kIG1vcmUuCgpgYGB7ciBzZXR1cH0KbGlicmFyeSh0aWR5dmVyc2UpCmBgYAoKCiMjIE9iamVjdHMgYW5kIENsYXNzZXMKCmBgYHtyIG9iamVjdHMsIGVjaG89RkFMU0V9CnggPSAxOjMKeSA9ICdhJwp6ID0gbGlzdChvbmUgPSB4LCB0d28gPSB5KQoKeAp5CnoKYGBgCgpgYGB7ciBpbnNwZWN0fQpzdHIoeikKY2xhc3MoeSkKYGBgCgojIyBEYXRhIFN0cnVjdHVyZXMKClZlY3RvcnMgZm9ybSB0aGUgYmFzaXMgb2YgUiBkYXRhIHN0cnVjdHVyZXMuIFR3byBtYWluIHR5cGVzIGFyZSBhdG9taWMgYW5kIGxpc3RzLgoKYGBge3IgdmVjdG9yfQpteV92ZWN0b3IgPC0gYygxLCAyLCAzKSAgICMgc3RhbmRhcmQgdmVjdG9yCmBgYAoKYGBge3IgbGlzdH0KbXlfbGlzdCA8LSBsaXN0KGEgPSAxLCBiID0gMikgICAjIGEgbmFtZWQgbGlzdApteV9saXN0CmBgYAoKIyMgRGF0YSBmcmFtZXMKCkRhdGEgZnJhbWVzIGFyZSBhIHNwZWNpYWwga2luZCBvZiBsaXN0LCBhbmQgcHJvYmFibHkgdGhlIG1vc3QgY29tbW9ubHkgdXNlZCBmb3IgZGF0YSBzY2llbmNlIHB1cnBvc2VzLgoKYGBge3IgZGF0YV9mcmFtZSwgZXZhbD1UUlVFfQpteV9kYXRhID0gZGF0YS5mcmFtZSgKICBpZCA9IDE6MywKICBuYW1lID0gYygnVmVybm9uJywgJ0FjZScsICdDb3JhJykKKQoKbXlfZGF0YQpjbGFzcyhteV9kYXRhKQpgYGAKCiMjIEltcG9ydGluZyBEYXRhCgpJbXBvcnRpbmcgZGF0YSBpcyB1c3VhbGx5IHRoZSBmaXJzdCBzdGVwLgoKYGBge3IgaW1wb3J0LCBldmFsPVRSVUV9CmRlbW9ncmFwaGljcyA9IHJlYWQuY3N2KCdkYXRhL2RlbW9zX2Fub255bWl6ZWQuY3N2JykKaWRzID0gcmVhZC5jc3YoJ2RhdGEvaWRzX2Fub255bWl6ZWQuY3N2JykKYGBgCgojIyBXb3JraW5nIHdpdGggRGF0YWJhc2VzCgpEYXRhYmFzZXMgbXVzdCBiZSBjb25uZWN0ZWQgdG8sIGJ1dCBvdGhlcndpc2UgYXJlIHVzZWQganVzdCBsaWtlIGRhdGEgZnJhbWVzLgoKYGBge3IgZGF0YWJhc2VzLCBldmFsPUZBTFNFfQojIHJlcXVpcmVzIERCSSBhbmQgUlNRTGl0ZSBwYWNrYWdlcwpsaWJyYXJ5KERCSSkKY29uIDwtIGRiQ29ubmVjdChSU1FMaXRlOjpTUUxpdGUoKSwgIjptZW1vcnk6IikKIyBjb24KCmNvcHlfdG8oY29uLCBkZW1vZ3JhcGhpY3MsICdkZW1vcycpCmBgYAoKCgojIyBTZWxlY3RpbmcgQ29sdW1ucwoKQSBjb21tb24gc3RlcCBpcyB0byBzdWJzZXQgdGhlIGRhdGEgYnkgY29sdW1uLgoKYGBge3Igc2VsZWN0MX0KZGVtb2dyYXBoaWNzICU+JSAKICBzZWxlY3QoZ2VuZGVyLCBhZ2UsIGxpYnVzZXIpCmBgYAoKCmBgYHtyIHNlbGVjdDJ9CmRlbW9ncmFwaGljcyAlPiUgCiAgc2VsZWN0KC1saWJ1c2VyKQpgYGAKCmBgYHtyIHNlbGVjdDN9CmRlbW9ncmFwaGljcyAlPiUgCiAgc2VsZWN0KHN0YXJ0c193aXRoKCdhd2FyZCcpKQpgYGAKCgojIyBGaWx0ZXJpbmcgUm93cwoKVG8gZmlsdGVyaW5nIGRhdGEsIHRoaW5rIG9mIGEgbG9naWNhbCBzdGF0ZW1lbnQsIHNvbWV0aGluZyB0aGF0IGNhbiBiZSBgVFJVRWAgb3IgYEZBTFNFYC4KCmBgYHtyIGZpbHRlcn0KbXlfZmlsdGVyZWRfZGF0YSA9IGRlbW9ncmFwaGljcyAlPiUgCiAgZmlsdGVyKGFnZSA8IDQwKQoKbXlfZmlsdGVyZWRfZGF0YSA9IGRlbW9ncmFwaGljcyAlPiUgCiAgZmlsdGVyKGxpYnVzZXIgPT0gMSkKYGBgCgoKIyMgR2VuZXJhdGluZyBuZXcgZGF0YQoKQW5vdGhlciB2ZXJ5IGNvbW1vbiBkYXRhIHByb2Nlc3NpbmcgdGFzayBpcyB0byBnZW5lcmF0ZSBuZXcgdmFyaWFibGVzLgoKYGBge3IgbXV0YXRlfQpkZW1vZ3JhcGhpY3MgPSBkZW1vZ3JhcGhpY3MgJT4lIAogIG11dGF0ZShuZXdfYWdlID0gKGFnZSAtIG1lYW4oYWdlLCBuYS5ybSA9IFQpKS9zZChhZ2UsIG5hLnJtID0gVCkpICAgCmBgYAoKIyMgUmVuYW1pbmcgY29sdW1ucwoKCgpgYGB7ciByZW5hbWUxfQpkZW1vZ3JhcGhpY3MgPSBkZW1vZ3JhcGhpY3MgJT4lIAogIHJlbmFtZShhZ2Vfc3RkID0gbmV3X2FnZSkKYGBgCgpgYGB7ciByZW5hbWUyfQpkZW1vZ3JhcGhpY3MgJT4lIAogIHJlbmFtZV9hbGwodG91cHBlcikgJT4lIAogIGNvbG5hbWVzKCkKYGBgCgojIyBNZXJnaW5nCgpNZXJnaW5nIGRhdGEgY2FuIHRha2Ugb24gYSB2YXJpZXR5IG9mIGZvcm1zLCBhbmQgZGVwZW5kaW5nIG9uIHRoZSBkYXRhLCBjYW4gYmUgYmUgcXVpdGUgY29tcGxpY2F0ZWQuCgpgYGB7ciBleGFtcGxlX2pvaW5zfQojIHNhbWUgTiByb3dzIGFzIGRlbW9zCmxlZnRfam9pbihkZW1vZ3JhcGhpY3MsIGlkcykKCiMgb25seSB+IDUwayByb3dzCmlubmVyX2pvaW4oZGVtb2dyYXBoaWNzLCBpZHMpIApgYGAKCiMjIEV4ZXJjaXNlcwoKIyMjIFNlbGVjdGluZyBhbmQgZmlsdGVyaW5nCgpVc2UgdGhlIGA6YCBvcGVyYXRvciB0byBzZWxlY3Qgc3VjY2Vzc2l2ZSBjb2x1bW5zLgoKYGBge3IgZXgxYSwgZXZhbD1GQUxTRX0KY29sbmFtZXMoZGVtb2dyYXBoaWNzKQoKZGVtb2dyYXBoaWNzICU+JSAKICBzZWxlY3QoPykKYGBgCgpGaWx0ZXIgdGhlIGRhdGEgdG8gYXdhcmQgYW1vdW50cyBsZXNzIHRoYW4gNTAwMDAwLgoKCmBgYHtyIGV4MWIsIGV2YWw9RkFMU0V9CmRlbW9ncmFwaGljcyAlPiUgCiAgZmlsdGVyKGF3YXJkX3RvdGFsX2Ftb3VudCA/KQpgYGAKCiMjIyBHZW5lcmF0aW5nIG5ldyBkYXRhCgpHZW5lcmF0ZSBhIG5ldyBhd2FyZCBhbW91bnQgdmFyaWFibGUgdGhhdCBpcyB0aGUgbG9nIG9mIHRoZSBvcmlnaW5hbC4gIEdpdmUgdGhlIG5ldyB2YXJpYWJsZSBhIHVzZWZ1bCBuYW1lLgoKYGBge3IgZXgyLCBldmFsPUZBTFNFfQpkZW1vZ3JhcGhpY3MgJT4lIAogIG11dGF0ZSg/ID0gbG9nKD8pKQpgYGAKCgojIyBQeXRob24gZXhhbXBsZXMKClVzaW5nIFB5dGhvbiBmb3IgZGF0YSBzY2llbmNlIGlzIG5vdCBmYXIgcmVtb3ZlZCBmcm9tIFIKClB5dGhvbidzIG1haW4gZGF0YSBwcm9jZXNzaW5nIG1vZHVsZSBpcyBgcGFuZGFzYAoKCiMjIyBJbXBvcnQKCmBgYHtweXRob24gcHlfaW1wb3J0LCBlbmdpbmUucGF0aD0gJy9Vc2Vycy9taWNsL2FuYWNvbmRhMy9iaW4vcHl0aG9uJ30KaW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgbnVtcHkgIGFzIG5wCgpkZW1vZ3JhcGhpY3MgPSBwZC5yZWFkX2NzdignZGF0YS9kZW1vc19hbm9ueW1pemVkLmNzdicpCmlkcyA9IHBkLnJlYWRfY3N2KCdkYXRhL2lkc19hbm9ueW1pemVkLmNzdicpCgpkZW1vZ3JhcGhpY3MuaGVhZCgpICAjIHNob3cgYSBmZXcgbGluZXMKYGBgCgojIyMgU2VsZWN0aW5nIENvbHVtbnMKCmBgYHtweXRob24gcHlfc2VsZWN0fQojIHNlbGVjdCBieSBuYW1lCmRlbW9ncmFwaGljc1tbJ2FnZScsICdhd2FyZF90b3RhbF9hbW91bnQnXV0KYGBgCgoKYGBge3B5dGhvbiBweV9zZWxlY3QyfQojIHNlbGVjdCBzdWNjZXNzaXZlIGNvbHVtbnMKZGVtb2dyYXBoaWNzLmxvY1s6LCdsaWJ1c2VyJzonYWdlJ10KYGBgCgoKYGBge3B5dGhvbiBweV9zZWxlY3QzfQojIHNlbGVjdCBieSBwYXR0ZXJuCmRlbW9ncmFwaGljcy5maWx0ZXIocmVnZXg9J15hd2FyZCcpIApgYGAKCiMjIyBGaWx0ZXJpbmcgUm93cwoKYGBge3B5dGhvbiBweV9maWx0ZXJ9Cm15X2ZpbHRlcmVkX2RhdGEgPSBkZW1vZ3JhcGhpY3NbZGVtb2dyYXBoaWNzLmxpYnVzZXIgPT0gMV0KbXlfZmlsdGVyZWRfZGF0YS5saWJ1c2VyLm51bmlxdWUoKQpgYGAKCiMjIyBHZW5lcmF0aW5nIG5ldyBkYXRhCgpgYGB7cHl0aG9uIHB5X211dGF0ZX0KZGVtb2dyYXBoaWNzW1snbmV3X2FnZSddXSA9IChkZW1vZ3JhcGhpY3NbWydhZ2UnXV0gLSBucC5tZWFuKGRlbW9ncmFwaGljc1tbJ2FnZSddXSkpIC8gbnAuc3RkKGRlbW9ncmFwaGljc1tbJ2FnZSddXSkKCmRlbW9ncmFwaGljcy5uZXdfYWdlLmRlc2NyaWJlKCkgICMgbWVhbiA9IDAgIHNkID0gMQpgYGAKCiMjIyBSZW5hbWluZyBjb2x1bW5zCgpgYGB7cHl0aG9uIHB5X3JlbmFtZX0KZGVtb2dyYXBoaWNzID0gZGVtb2dyYXBoaWNzICU+JSAKICByZW5hbWUoYWdlX3N0ZCA9IG5ld19hZ2UpCmBgYAoKCiMjIyBKb2lucwoKYGBge3B5dGhvbiBweV9sZWZ0X2pvaW59CmRlbW9zX2pvaW5lZCA9IHBkLm1lcmdlKGRlbW9ncmFwaGljcywgaWRzLCBob3c9J2xlZnQnLCBvbj0nRU1QTElEJykKZGVtb3Nfam9pbmVkCmBgYAoKYGBge3B5dGhvbiBweV9sZWZ0X2pvaW4yfQpkZW1vc19qb2luZWQgPSBkZW1vZ3JhcGhpY3Muam9pbihpZHMsIGhvdz0nbGVmdCcsIGxzdWZmaXg9J0VNUExJRCcpCmRlbW9zX2pvaW5lZC5zaGFwZQpkZW1vc19qb2luZWQuY29sdW1ucwpgYGAKCgpgYGB7cHl0aG9uIHB5X2lubmVyX2pvaW59CmRlbW9zX2pvaW5lZCA9IGRlbW9ncmFwaGljcy5qb2luKGlkcywgaG93PSdpbm5lcicsIGxzdWZmaXg9J0VNUExJRCcpCgpkZW1vc19qb2luZWQuY29sdW1ucwpgYGAKCg==